xy
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
## Warning: Paket 'readr' wurde unter R Version 4.1.3 erstellt
##
## Attache Paket: 'dlookr'
## Das folgende Objekt ist maskiert 'package:base':
##
## transform
## Warning: Paket 'dplyr' wurde unter R Version 4.1.3 erstellt
##
## Attache Paket: 'dplyr'
## Die folgenden Objekte sind maskiert von 'package:stats':
##
## filter, lag
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: Paket 'DescTools' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'ROSE' wurde unter R Version 4.1.3 erstellt
## Loaded ROSE 0.0-4
## Warning: Paket 'car' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'carData' wurde unter R Version 4.1.3 erstellt
##
## Attache Paket: 'car'
## Das folgende Objekt ist maskiert 'package:DescTools':
##
## Recode
## Das folgende Objekt ist maskiert 'package:dplyr':
##
## recode
##
## Attache Paket: 'plotly'
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## last_plot
## Das folgende Objekt ist maskiert 'package:stats':
##
## filter
## Das folgende Objekt ist maskiert 'package:graphics':
##
## layout
## Warning: Paket 'tidyverse' wurde unter R Version 4.1.3 erstellt
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.2.1 v stringr 1.5.0
## v tidyr 1.3.0 v forcats 0.5.1
## v purrr 1.0.1
## Warning: Paket 'tibble' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'tidyr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'purrr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'stringr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'forcats' wurde unter R Version 4.1.1 erstellt
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract() masks dlookr::extract()
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## x car::recode() masks dplyr::recode()
## x purrr::some() masks car::some()
## Warning: Paket 'corrplot' wurde unter R Version 4.1.3 erstellt
## corrplot 0.92 loaded
## Warning: Paket 'GGally' wurde unter R Version 4.1.3 erstellt
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Warning: Paket 'gridExtra' wurde unter R Version 4.1.1 erstellt
##
## Attache Paket: 'gridExtra'
## Das folgende Objekt ist maskiert 'package:dplyr':
##
## combine
## Warning: Paket 'caret' wurde unter R Version 4.1.3 erstellt
##
## Attache Paket: 'caret'
## Das folgende Objekt ist maskiert 'package:purrr':
##
## lift
## Die folgenden Objekte sind maskiert von 'package:DescTools':
##
## MAE, RMSE
## Warning: Paket 'ROCR' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'outliers' wurde unter R Version 4.1.3 erstellt
## [[1]]
## [1] "readr" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "ggplot2" "readr" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "dlookr" "ggplot2" "readr" "stats" "graphics" "grDevices"
## [7] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "dplyr" "dlookr" "ggplot2" "readr" "stats" "graphics"
## [7] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "RColorBrewer" "dplyr" "dlookr" "ggplot2" "readr"
## [6] "stats" "graphics" "grDevices" "utils" "datasets"
## [11] "methods" "base"
##
## [[6]]
## [1] "DescTools" "RColorBrewer" "dplyr" "dlookr" "ggplot2"
## [6] "readr" "stats" "graphics" "grDevices" "utils"
## [11] "datasets" "methods" "base"
##
## [[7]]
## [1] "ROSE" "DescTools" "RColorBrewer" "dplyr" "dlookr"
## [6] "ggplot2" "readr" "stats" "graphics" "grDevices"
## [11] "utils" "datasets" "methods" "base"
##
## [[8]]
## [1] "ggcorrplot" "ROSE" "DescTools" "RColorBrewer" "dplyr"
## [6] "dlookr" "ggplot2" "readr" "stats" "graphics"
## [11] "grDevices" "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "car" "carData" "ggcorrplot" "ROSE" "DescTools"
## [6] "RColorBrewer" "dplyr" "dlookr" "ggplot2" "readr"
## [11] "stats" "graphics" "grDevices" "utils" "datasets"
## [16] "methods" "base"
##
## [[10]]
## [1] "plotly" "car" "carData" "ggcorrplot" "ROSE"
## [6] "DescTools" "RColorBrewer" "dplyr" "dlookr" "ggplot2"
## [11] "readr" "stats" "graphics" "grDevices" "utils"
## [16] "datasets" "methods" "base"
##
## [[11]]
## [1] "forcats" "stringr" "purrr" "tidyr" "tibble"
## [6] "tidyverse" "plotly" "car" "carData" "ggcorrplot"
## [11] "ROSE" "DescTools" "RColorBrewer" "dplyr" "dlookr"
## [16] "ggplot2" "readr" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[12]]
## [1] "corrplot" "forcats" "stringr" "purrr" "tidyr"
## [6] "tibble" "tidyverse" "plotly" "car" "carData"
## [11] "ggcorrplot" "ROSE" "DescTools" "RColorBrewer" "dplyr"
## [16] "dlookr" "ggplot2" "readr" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
##
## [[13]]
## [1] "GGally" "corrplot" "forcats" "stringr" "purrr"
## [6] "tidyr" "tibble" "tidyverse" "plotly" "car"
## [11] "carData" "ggcorrplot" "ROSE" "DescTools" "RColorBrewer"
## [16] "dplyr" "dlookr" "ggplot2" "readr" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
##
## [[14]]
## [1] "gridExtra" "GGally" "corrplot" "forcats" "stringr"
## [6] "purrr" "tidyr" "tibble" "tidyverse" "plotly"
## [11] "car" "carData" "ggcorrplot" "ROSE" "DescTools"
## [16] "RColorBrewer" "dplyr" "dlookr" "ggplot2" "readr"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[15]]
## [1] "caret" "lattice" "gridExtra" "GGally" "corrplot"
## [6] "forcats" "stringr" "purrr" "tidyr" "tibble"
## [11] "tidyverse" "plotly" "car" "carData" "ggcorrplot"
## [16] "ROSE" "DescTools" "RColorBrewer" "dplyr" "dlookr"
## [21] "ggplot2" "readr" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
##
## [[16]]
## [1] "ROCR" "caret" "lattice" "gridExtra" "GGally"
## [6] "corrplot" "forcats" "stringr" "purrr" "tidyr"
## [11] "tibble" "tidyverse" "plotly" "car" "carData"
## [16] "ggcorrplot" "ROSE" "DescTools" "RColorBrewer" "dplyr"
## [21] "dlookr" "ggplot2" "readr" "stats" "graphics"
## [26] "grDevices" "utils" "datasets" "methods" "base"
##
## [[17]]
## [1] "outliers" "ROCR" "caret" "lattice" "gridExtra"
## [6] "GGally" "corrplot" "forcats" "stringr" "purrr"
## [11] "tidyr" "tibble" "tidyverse" "plotly" "car"
## [16] "carData" "ggcorrplot" "ROSE" "DescTools" "RColorBrewer"
## [21] "dplyr" "dlookr" "ggplot2" "readr" "stats"
## [26] "graphics" "grDevices" "utils" "datasets" "methods"
## [31] "base"
We import the csv file “loan_sample_9.csv” and make a copy of it to ensure that we don’t mess up the original dataset.
In the first step we explore the data. We start by investigating the structure of the data set. There are 12 numeric and 5 categorical variables in the dataset. But the numeric variable “Status” with its values “1” and “0” looks like a factor and all the characteristic variables also look like factors.
## # A tibble: 6 x 17
## loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6000 18.2 D RENT 90000 Not Verified debt_c~
## 2 8000 13.3 C MORTGAGE 70000 Verified home_i~
## 3 6000 14.0 C MORTGAGE 54000 Source Verified debt_c~
## 4 1500 15.6 D RENT 53000 Not Verified credit~
## 5 7000 10.1 B RENT 65000 Not Verified debt_c~
## 6 5000 12.7 C RENT 37000 Not Verified debt_c~
## # i 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## # revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## # application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## # Status <dbl>
## # A tibble: 6 x 17
## loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 2000 8.18 B RENT 47000 Source Verified credit~
## 2 6000 14.5 C RENT 38000 Source Verified debt_c~
## 3 2500 9.93 B OWN 23000 Not Verified other
## 4 16000 19.0 D RENT 60000 Source Verified debt_c~
## 5 7000 9.17 B RENT 34000 Source Verified small_~
## 6 14400 17.0 D MORTGAGE 110000 Source Verified debt_c~
## # i 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## # revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## # application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## # Status <dbl>
## spc_tbl_ [40,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ loan_amnt : num [1:40000] 6000 8000 6000 1500 7000 ...
## $ int_rate : num [1:40000] 18.2 13.3 14 15.6 10.1 ...
## $ grade : chr [1:40000] "D" "C" "C" "D" ...
## $ home_ownership : chr [1:40000] "RENT" "MORTGAGE" "MORTGAGE" "RENT" ...
## $ annual_inc : num [1:40000] 90000 70000 54000 53000 65000 37000 70000 36000 40000 15000 ...
## $ verification_status: chr [1:40000] "Not Verified" "Verified" "Source Verified" "Not Verified" ...
## $ purpose : chr [1:40000] "debt_consolidation" "home_improvement" "debt_consolidation" "credit_card" ...
## $ dti : num [1:40000] 25.67 6.72 13.16 16.85 2.36 ...
## $ open_acc : num [1:40000] 15 8 9 5 7 6 7 12 8 7 ...
## $ revol_bal : num [1:40000] 10839 690 8057 18382 4124 ...
## $ revol_util : num [1:40000] 28.7 3.4 42.6 85.1 19.3 36 74.1 22.7 60.1 57.4 ...
## $ total_acc : num [1:40000] 28 16 18 18 10 9 7 17 15 10 ...
## $ total_rec_int : num [1:40000] 1153 705 1088 338 142 ...
## $ application_type : chr [1:40000] "Individual" "Individual" "Individual" "Individual" ...
## $ tot_cur_bal : num [1:40000] 90776 199277 148632 23795 4124 ...
## $ total_rev_hi_lim : num [1:40000] 37745 20400 18900 21600 21400 ...
## $ Status : num [1:40000] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. loan_amnt = col_double(),
## .. int_rate = col_double(),
## .. grade = col_character(),
## .. home_ownership = col_character(),
## .. annual_inc = col_double(),
## .. verification_status = col_character(),
## .. purpose = col_character(),
## .. dti = col_double(),
## .. open_acc = col_double(),
## .. revol_bal = col_double(),
## .. revol_util = col_double(),
## .. total_acc = col_double(),
## .. total_rec_int = col_double(),
## .. application_type = col_character(),
## .. tot_cur_bal = col_double(),
## .. total_rev_hi_lim = col_double(),
## .. Status = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We check the presence of NAs in each of the variables included in the dataset. There are no NAs values in this dataset.
knitr::kable(apply(data, 2, function(x) any(is.na(x))))
| x | |
|---|---|
| loan_amnt | FALSE |
| int_rate | FALSE |
| grade | FALSE |
| home_ownership | FALSE |
| annual_inc | FALSE |
| verification_status | FALSE |
| purpose | FALSE |
| dti | FALSE |
| open_acc | FALSE |
| revol_bal | FALSE |
| revol_util | FALSE |
| total_acc | FALSE |
| total_rec_int | FALSE |
| application_type | FALSE |
| tot_cur_bal | FALSE |
| total_rev_hi_lim | FALSE |
| Status | FALSE |
Now we have 12 numeric and 5 character variables.
overview <- overview(data)
plot(overview)
***
We transform the characteristic variables in factors to count the categories and order them.
data$grade = as.factor(data$grade)
data$home_ownership = as.factor(data$home_ownership)
data$verification_status = as.factor(data$verification_status)
data$purpose = as.factor(data$purpose)
data$application_type = as.factor(data$application_type)
data$Status = as.factor(data$Status)
data <- data %>%
select(order(sapply(., is.factor)),order(sapply(., is.numeric)))
overview <- overview(data)
plot(overview)
***
In most numerical variables there is a large gap between the minimum and maximum.
For example, “loan-amnt” (amount of the loan applied for by the borrower) has a minimum of 1,000 and a maximum of 40,000, or “revol_bal” (Total credit revolving balance) from USD 0 to USD 78,762.
The average interest rate “int_rate” is around 12.63%, with values between 5.31% and 27.49%.
The annual income “annual_inc” of borrowers varies greatly, with an average of around USD 63,277.
There are outliers with very high annual salaries.
There are borrowers with a dti of 0, which could indicate low indebtedness.
The Variable “purpose” (category provided by the borrower for the loan request) has many categories. They contain the name of the type of loan, except for one group. This group is labeled as “other” and contains 2,283 values. Most loans are used for debt consolidation and credit cards.
The most people are graded between “B” and “C”, in the grades “A” or “B” are similar number of people. The variable “grade” assigned loan grade by the financial service provider.
The most people are in rent or has a mortgage for there home. 3,982 people are home owner. 14,278 people from 40,000 aren’t verified.
We see that 14,278 people are not verifide from 40,000 people. 16,129 are source verifide.
Only 530 joined via App from 40,000 people in the System.
The target variable “Status” is unbalanced, as there are more loans without default (status 0 = 34,794 persons) than with default (status 1 = 5,206).
summary(data)
## loan_amnt int_rate annual_inc dti
## Min. : 1000 Min. : 5.31 Min. : 6600 Min. : 0.00
## 1st Qu.: 7000 1st Qu.: 9.44 1st Qu.: 42000 1st Qu.:12.17
## Median :10050 Median :12.29 Median : 57000 Median :17.67
## Mean :11682 Mean :12.63 Mean : 63277 Mean :18.24
## 3rd Qu.:15125 3rd Qu.:15.05 3rd Qu.: 77000 3rd Qu.:23.89
## Max. :40000 Max. :27.49 Max. :400000 Max. :60.14
##
## open_acc revol_bal revol_util total_acc
## Min. : 1.00 Min. : 0 Min. : 0.00 Min. : 3.00
## 1st Qu.: 8.00 1st Qu.: 5619 1st Qu.: 34.80 1st Qu.:15.00
## Median :10.00 Median : 9760 Median : 52.50 Median :20.00
## Mean :10.29 Mean :11948 Mean : 52.24 Mean :21.27
## 3rd Qu.:13.00 3rd Qu.:15792 3rd Qu.: 70.00 3rd Qu.:27.00
## Max. :23.00 Max. :78762 Max. :123.20 Max. :57.00
##
## total_rec_int tot_cur_bal total_rev_hi_lim grade home_ownership
## Min. : 0.0 Min. : 0 Min. : 400 A: 7274 MORTGAGE:17736
## 1st Qu.: 680.2 1st Qu.: 25136 1st Qu.: 12998 B:13263 OWN : 3982
## Median :1345.5 Median : 53821 Median : 20700 C:11807 RENT :18282
## Mean :1820.6 Mean : 99208 Mean : 24089 D: 7656
## 3rd Qu.:2433.9 3rd Qu.:158638 3rd Qu.: 32000
## Max. :8834.9 Max. :472573 Max. :100000
##
## verification_status purpose application_type
## Not Verified :14278 debt_consolidation:23414 Individual:39470
## Source Verified:16129 credit_card : 9362 Joint App : 530
## Verified : 9593 other : 2283
## home_improvement : 2095
## major_purchase : 807
## medical : 445
## (Other) : 1594
## Status
## 0:34794
## 1: 5206
##
##
##
##
##
In the next step, we investigate our target variable “Status”. We notice also before in our sample, that we have 5,206 persons which did not default on their loan and we have 34,794 persons which did default.
As we can see in the visualization the data set is highly imbalanced.
ggplot(data, aes(x = Status, fill = Status)) +
geom_bar() +
ylab("Count") +
xlab("Status of the loan")
PercTable(data$Status)
##
## freq perc
##
## 0 34'794 87.0%
## 1 5'206 13.0%
The visualizations are all skewed to the right. However, the variable “dti” is almost bell-shaped.
Almost all numerical variables show many outliers on the visualization. The variable with the best visibility is the variable “annual_inc”.
The variable “total_acc” has a moderate number of outliers, “open_acc” and “revol_util” have few outliers.
Here we see the box plots next to each other on one visualization.
This view confirms the results of the previous visualization.
- The variable “loan_amnt” has a moderate number of outliers, which could indicate that there are loans with unusually high amounts.
- The variable “int_rate” shows some outliers with higher interest rates, which may indicate special loan conditions.
- The variable “annual_inc” shows some outliers with very high annual salaries, which could indicate individuals with exceptionally high incomes.
- The variable “dti” shows only a few outliers, which could indicate unusually high debt-to-income ratios for some applicants.
- The variable “open_acc” shows some outliers, which could indicate borrowers with an unusually high number of open credit accounts.
- The variable “revol_bal” shows many outliers with high revolving balance amounts, which could indicate borrowers with large credit card balances.
- The variable “revol_util” has only one outlier. This outlier could indicate an unusually high revolving utilization rate for a borrower.
- The variable “total_acc” shows some outliers, which could indicate borrowers with an unusually large number of total credit accounts.
- The variable “total_rec_int” has many outliers with high total interest payments, which could indicate special loan conditions or exceptionally high interest rates, as we have already seen with the variable “int_rate”.
- The variable “tot_cur_bal” shows some outliers with high total balances, which could indicate borrowers with significant loan account balances.
- The variable “total_rev_hi_lim” shows many outliers with high total credit limits, which could indicate borrowers with large credit limits.
knitr::kable(diagnose_outlier(data), caption = "Diagnose Outlier", digits = 2)
| variables | outliers_cnt | outliers_ratio | outliers_mean | with_mean | without_mean |
|---|---|---|---|---|---|
| loan_amnt | 814 | 2.04 | 29886.58 | 11682.24 | 11304.09 |
| int_rate | 724 | 1.81 | 25.06 | 12.63 | 12.40 |
| annual_inc | 1392 | 3.48 | 161791.42 | 63276.84 | 59724.93 |
| dti | 90 | 0.22 | 47.05 | 18.24 | 18.18 |
| open_acc | 335 | 0.84 | 21.41 | 10.29 | 10.20 |
| revol_bal | 1745 | 4.36 | 39407.30 | 11948.31 | 10695.77 |
| revol_util | 1 | 0.00 | 123.20 | 52.24 | 52.24 |
| total_acc | 284 | 0.71 | 48.65 | 21.27 | 21.08 |
| total_rec_int | 2292 | 5.73 | 6460.17 | 1820.57 | 1538.57 |
| tot_cur_bal | 788 | 1.97 | 396840.36 | 99208.04 | 93226.85 |
| total_rev_hi_lim | 983 | 2.46 | 70119.58 | 24089.05 | 22929.35 |
We note that for the variables “annual_inc”, “revol_bal” and “total_rev_hi_lim” the visualization changes considerably and there the median also tends to shift.
The distribution of the variable “loan_amnt” would fluctuate somewhat more.
The other variables would not change much.
We do Interquartil Range (IQR) methode for dealing with the highest outliers.
outlier <- function(x, trim = 0.05) {
q <- quantile(x, c(trim, 1 - trim), na.rm = TRUE)
x[x < q[1]] <- q[1]
x[x > q[2]] <- q[2]
return(x)
}
data_new_under <- map_df(data[,-c(12:17)], outlier)
cols <- data[,c(12:17)]
data_new_under <- cbind(data_new_under, cols)
boxplot(scale(data_new_under[,c(1:11)]), use.cols = TRUE)
***
knitr::kable(diagnose_outlier(data_new_under), caption = "Diagnose Outlier", digits = 2)
| variables | outliers_cnt | outliers_ratio | outliers_mean | with_mean | without_mean |
|---|---|---|---|---|---|
| loan_amnt | 0 | 0.00 | NaN | 11578.02 | 11578.02 |
| int_rate | 0 | 0.00 | NaN | 12.52 | 12.52 |
| annual_inc | 0 | 0.00 | NaN | 62030.81 | 62030.81 |
| dti | 0 | 0.00 | NaN | 18.19 | 18.19 |
| open_acc | 0 | 0.00 | NaN | 10.24 | 10.24 |
| revol_bal | 0 | 0.00 | NaN | 11571.70 | 11571.70 |
| revol_util | 0 | 0.00 | NaN | 52.32 | 52.32 |
| total_acc | 0 | 0.00 | NaN | 21.12 | 21.12 |
| total_rec_int | 2292 | 5.73 | 5303.03 | 1758.40 | 1542.95 |
| tot_cur_bal | 0 | 0.00 | NaN | 96715.11 | 96715.11 |
| total_rev_hi_lim | 0 | 0.00 | NaN | 23687.76 | 23687.76 |
winsorize <- function(x, trim = 0.05) {
q <- quantile(x, c(trim, 1 - trim), na.rm = TRUE)
x[x < q[1]] <- q[1]
x[x > q[2]] <- q[2]
return(x)
}
# Winsorizing auf 'total_rec_int' anwenden
data_new_under2 <- data_new_under %>%
mutate(total_rec_int_winsorized = winsorize(total_rec_int))
# Vergleiche vorher/nachher
summary(data_new_under$total_rec_int)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157.8 680.2 1345.5 1758.4 2433.9 5319.9
summary(data_new_under2$total_rec_int_winsorized)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157.8 680.2 1345.5 1758.4 2433.9 5319.9
diagnose_outlier(data_new_under2)
## variables outliers_cnt outliers_ratio outliers_mean
## 1 loan_amnt 0 0.00 NaN
## 2 int_rate 0 0.00 NaN
## 3 annual_inc 0 0.00 NaN
## 4 dti 0 0.00 NaN
## 5 open_acc 0 0.00 NaN
## 6 revol_bal 0 0.00 NaN
## 7 revol_util 0 0.00 NaN
## 8 total_acc 0 0.00 NaN
## 9 total_rec_int 2292 5.73 5303.029
## 10 tot_cur_bal 0 0.00 NaN
## 11 total_rev_hi_lim 0 0.00 NaN
## 12 total_rec_int_winsorized 2292 5.73 5303.017
## with_mean without_mean
## 1 11578.01937 11578.01937
## 2 12.52322 12.52322
## 3 62030.81008 62030.81008
## 4 18.18800 18.18800
## 5 10.23893 10.23893
## 6 11571.69927 11571.69927
## 7 52.31697 52.31697
## 8 21.11825 21.11825
## 9 1758.39980 1542.94710
## 10 96715.11485 96715.11485
## 11 23687.76467 23687.76467
## 12 1758.39916 1542.94715
ggpairs(data[, c("loan_amnt", "int_rate", "annual_inc", "dti", "total_acc", "total_rec_int", "tot_cur_bal")],
aes(color = as.factor(data$Status)))
In the next step, we carry-out under sampling and visualizate it again.
set.seed(7)
data_original <- data
data_balanced <- ovun.sample(Status ~ ., data=data, method = "under")
data_under <- data.frame(data_balanced[["data"]])
Visualization of the level of the target variable
Balancing the data set is necessary to ensure that the model is not influenced by an excessive presence of one class over the other. Models can tend to focus on the dominant class and ignore the minority class if the data set is not balanced. Balancing ensures that both classes are equally represented, which can lead to a more balanced model.
set.seed(7)
div <- createDataPartition(y = data_new_under$Status, p = 0.7, list = F)
# Training Sample
data.train <- data_new_under[div,] # 70% here
# Test Sample
data.test <- data_new_under[-div,] # rest of the 30% data goes here
fit1 <- glm(Status ~ ., data=data.train,family=binomial())
summary(fit1)
##
## Call:
## glm(formula = Status ~ ., family = binomial(), data = data.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2946 -0.5662 -0.4447 -0.3126 2.8068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.675e+00 2.576e-01 -18.151 < 2e-16 ***
## loan_amnt 6.065e-05 4.826e-06 12.567 < 2e-16 ***
## int_rate 1.478e-01 1.424e-02 10.381 < 2e-16 ***
## annual_inc -4.440e-06 1.019e-06 -4.359 1.31e-05 ***
## dti 1.820e-02 2.795e-03 6.512 7.42e-11 ***
## open_acc 3.617e-02 7.362e-03 4.913 8.97e-07 ***
## revol_bal -3.509e-06 6.918e-06 -0.507 0.612015
## revol_util 1.184e-03 1.575e-03 0.752 0.452298
## total_acc -9.083e-03 3.075e-03 -2.954 0.003134 **
## total_rec_int -2.538e-04 1.839e-05 -13.802 < 2e-16 ***
## tot_cur_bal -8.456e-07 3.157e-07 -2.679 0.007392 **
## total_rev_hi_lim -3.337e-06 3.713e-06 -0.899 0.368774
## gradeB 3.272e-01 9.146e-02 3.578 0.000347 ***
## gradeC 3.428e-01 1.212e-01 2.828 0.004691 **
## gradeD 3.102e-01 1.733e-01 1.789 0.073550 .
## home_ownershipOWN 2.462e-02 6.963e-02 0.354 0.723687
## home_ownershipRENT 1.794e-01 5.173e-02 3.467 0.000526 ***
## verification_statusSource Verified 8.555e-02 4.460e-02 1.918 0.055112 .
## verification_statusVerified 9.229e-02 5.005e-02 1.844 0.065215 .
## purposecredit_card -7.194e-02 1.906e-01 -0.378 0.705782
## purposedebt_consolidation -3.807e-02 1.874e-01 -0.203 0.839036
## purposehome_improvement 1.538e-01 2.036e-01 0.755 0.450042
## purposehouse -2.329e-01 3.272e-01 -0.712 0.476490
## purposemajor_purchase -8.040e-03 2.275e-01 -0.035 0.971814
## purposemedical 4.404e-01 2.362e-01 1.865 0.062245 .
## purposemoving -3.275e-01 2.803e-01 -1.168 0.242748
## purposeother -2.291e-02 1.991e-01 -0.115 0.908396
## purposerenewable_energy 4.274e-01 5.137e-01 0.832 0.405366
## purposesmall_business 5.067e-01 2.422e-01 2.092 0.036448 *
## purposevacation 1.864e-01 2.632e-01 0.708 0.478890
## purposewedding 2.764e-02 5.086e-01 0.054 0.956657
## application_typeJoint App 2.282e-01 1.383e-01 1.650 0.098931 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21657 on 28000 degrees of freedom
## Residual deviance: 20117 on 27969 degrees of freedom
## AIC: 20181
##
## Number of Fisher Scoring iterations: 5
From the results obtained, we can deduce the following points. Deviance residuals: * The moderate range of deviance residuals (-2.08 to 2.10) suggests a reasonable fit to the model. Smaller residuals would indicate a more precise fit, but these values are acceptable. Coefficients: * Positive coefficients for “loan_amnt” and “int_rate” indicate that higher loan amounts and interest rates are associated with a higher probability of belonging to the positive class. Significance: * Statistically significant predictors include “loan_amnt,” “int_rate,” “annual_inc,” “total_rec_int,” “gradeB,” and “gradeC,” crucial characteristics that influence the model. Zero and residual deviance: * The reduction of deviance from null model 10103.3 to residual deviance 9254.7 suggests that the model, explains some of the variability of the response variable. AIC: * The AIC value of 9318.7 is an indicator of model fit and complexity. Although it could be lower, the AIC is still a reasonable value considering the number of predictors.
Conduct cross-validation to ensure generalizability of the model. In summary, the model shows promise with significant predictors, but there is room for improvement. Further analysis and refinement can improve its predictive capabilities and overall performance.
data.test$fit1_score <- predict(fit1,type='response',data.test)
fit1_pred <- prediction(data.test$fit1_score, data.test$Status)
fit1_roc <- performance(fit1_pred, "tpr", "fpr")
plot(fit1_roc, lwd=1, colorize = TRUE, main = "Fit1: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)
Through the ROC (Receiver Operating Characteristic) curve, we can evaluate the performance of the classification algorithm. There are several aspects that we can identify and comment on.
The shows the relationship between the true positive rate (sensitivity) and the false positive rate (1 - specificity) for different thresholds. The true positive rate is shown on the Y-axis and the false positive rate on the X-axis. The diagonal line represents a random rate classifier. A good classifier is above this line, which means it achieves a higher true positive rate than false positive rate for different thresholds. On the right we have the color scale which represents the threshold at which the corresponding rate is reached. The red areas represent higher thresholds and the blue areas lower thresholds.
In our case the curve appears to be well above the diagonal, indicating a better classifier than a random guess. The color scale can be useful to see how thresholds affect evaluation metrics.
fit1_precision <- performance(fit1_pred, measure = "prec", x.measure = "rec")
plot(fit1_precision, main="Fit1: Logit - Precision vs Recall")
With the precision recall curve we can evaluate the classification model used and understand whether the classes are unequally distributed or not. In the graph, the x and y axes are called recall and precision respectively: Recall (X-axis): Percentage of actual positive cases that were recognized as positive. Recall is a measure of how many of the actual positive cases the model correctly identified. Precision (Y-axis): Proportion of relevant instances among instances classified as positive. Precision is a measure of how many of the cases classified as positive are actually positive. The curve in the graph shows the trade-off between precision and recall for different thresholds. Perfect classification would produce a curve at the top right of the graph where both precision and recall are 1. In this case, precision starts to be high when recall is low. This means that the model is very selective when it decides to classify an instance as positive. As recall increases (the model tries to capture more true positive cases), precision decreases. This is a typical trade-off, as it is often difficult to achieve high precision and high recall at the same time. Since the curve is directed upwards in the right corner, it can be deduced that there is high precision and recall.
confusionMatrix(as.factor(round(data.test$fit1_score)), data.test$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10431 1551
## 1 7 10
##
## Accuracy : 0.8702
## 95% CI : (0.864, 0.8761)
## No Information Rate : 0.8699
## P-Value [Acc > NIR] : 0.4743
##
## Kappa : 0.0099
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.999329
## Specificity : 0.006406
## Pos Pred Value : 0.870556
## Neg Pred Value : 0.588235
## Prevalence : 0.869906
## Detection Rate : 0.869322
## Detection Prevalence : 0.998583
## Balanced Accuracy : 0.502868
##
## 'Positive' Class : 0
##
Confusion Matrix * It correctly identifies 64.19% of instances (Accuracy) with 62.65% sensitivity (True Positive Rate) and 65.73% specificity (True Negative Rate). Kappa Statistic * The Kappa value of 0.2838 indicates fair agreement beyond random chance. Positive Predictive Value (Precision): * Precision is at 64.64%, meaning that when the model predicts the positive class, it is correct 64.64% of the time. Balanced Accuracy: * The balanced accuracy is 64.19%, reflecting a balance between sensitivity and specificity. Prevalence and Detection Rate: * The prevalence of the positive class is 50%, and the model detects it in 31.33% of cases. Mcnemar’s Test * McNemar’s test does not show a significant difference in errors between predictions. In conclusion, the model demonstrates moderate performance, but there is room for improvement.
fit1_auc <- performance(fit1_pred, measure = "auc")
cat("AUC: ",fit1_auc@y.values[[1]]*100)
## AUC: 69.1709
The AUC-Value of 70.46687 falls in to the fair discrimination range. While it suggests some ability of our model to distinguish between the two classes, there is definelty room for improvement. It could be valuable to compare our AUC-Value to that of other models to gain further context regarding our model’s performance.
Loan utilization ratio:
This ratio could indicate how much of the available loan has already been used.
summary(data$loan_amnt / data$total_rev_hi_lim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01562 0.32692 0.52356 0.66191 0.80279 80.00000
Income to loan ratio:
This could indicate how well the borrower is able to repay the loan, based on their income.
summary(data$annual_inc / data$loan_amnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.264 3.667 5.322 7.385 8.304 152.000
Risikokategorien für den Schulden-zu-Einkommen (DTI):
A new categorical variable that categorizes debt-to-income (dti) into different risk categories. For example, low, medium and high risk based on certain thresholds.
summary(cut(data$dti, breaks = c(-Inf, 15, 25, Inf), labels = c("Low Risk", "Medium Risk", "High Risk")))
## Low Risk Medium Risk High Risk
## 15121 16355 8524
What challenges in making credit decisions would a company face if it were to use our model in its day-to-day business? These challenges are captured in the four common ethical issues in the context of creating value from data:
Privacy and data security * Collecting and using various financial and personal variables (e.g., “loan_amnt”, “int_rate”, “annual_inc”) for credit decisions requires a strong privacy framework for used customer data. Ensuring encryption, secure storage and compliance with privacy regulations are critical, considering the sensitive nature of financial information.
Algorithmic bias and fairness * The model coefficients reveal that some variables, such as “grade B” and “grade C”, have a significant impact on the predictions. It is essential to carefully examine these variables for possible biases, ensuring that credit decisions are fair and impartial across different grades and demographic groups.
Accountability and Accountability * Model performance parameters, including accuracy and sensitivity, provide a basis for evaluating its effectiveness. Establishing accountability for model results is critical, especially with significant predictors like “loan_amnt” and “int_rate.” Transparent communication about how decisions are made is essential for accountability.
Impact on the workforce * Implementing the credit decision model may impact the workforce involved in manual credit assessments. Workforce implications, including potential job role changes, should be considered. Ethical considerations involve transparent communication about these changes and efforts to mitigate any negative impacts on the workforce.
In conclusion, while the logistic regression model shows promise in predicting credit decisions, addressing ethical issues requires a comprehensive approach. Ensure rigorous data privacy measures, continually evaluate and mitigate algorithmic bias, establish accountability for model results, and consider social impact on the workforce. Engaging in ongoing ethical discussions and staying attuned to the implications of model decisions will contribute to responsible and ethical implementation in daily business operations.
Companies can overcome or mitigate the problems and difficulties described above associated with implementing predictive models, particularly in credit decision making In the following way:
Data Privacy & Security: Implement Robust Security Measures * Employ encryption and secure storage protocols to protect sensitive data. Adopt anonymization and aggregation techniques to minimize the exposure of individual details. Ensure compliance with data protection regulations and obtain explicit consent from individuals for data usage.
Algorithmic Bias & Fairness: Continuous Monitoring and Fairness Audits * Regularly monitor and assess model predictions for biases. Conduct fairness audits, particularly focusing on variables with significant impact. Adjust the model as needed to ensure fairness across different demographic groups.
Accountability & Responsibility: Establish Clear Accountability and Transparency *Clearly define roles and responsibilities for individuals involved in model development and deployment. Maintain transparent documentation of the model’s decision-making process. Establish mechanisms for accountability and redress in case of errors or unintended consequences.
Impact on the Workforce: Responsible Workforce Management * Provide training and upskilling opportunities for employees affected by automation. Communicate transparently about changes in job roles or responsibilities. Consider the societal impact and contribute to initiatives that support workforce development in the face of technological advancements.
By adopting these strategies, companies can navigate the ethical challenges associated with deploying predictive models for credit decisions, fostering responsible and transparent practices in their daily business operations. Regular reassessment and adaptation to evolving ethical standards and regulations are essential for continued ethical performance